<?php
/**
 * <https://y.st./>
 * Copyright © 2015 Alex Yst <mailto:copyright@y.st>
 * 
 * This program is free software: you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation, either version 3 of the License, or
 * (at your option) any later version.
 * 
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
 * GNU General Public License for more details.
 * 
 * You should have received a copy of the GNU General Public License
 * along with this program. If not, see <https://www.gnu.org./licenses/>.
**/

$xhtml = array(
	'<{title}>' => 'The spider&apos;s first run',
	'<{body}>' => <<<END
<p>
	For my new search engine to be at all effective, it needs to know how to handle relative $a[URI]s.
	It obviously cannot request pages with them directly, so I built a function that takes a base $a[URI] and a relative $a[URI] and merges them to form a new absolute $a[URI].
	I found that $a[PHP]&apos;s <a href="https://php.net/manual/en/function.parse-url.php"><code>parse_url()</code> function</a> was very helpful for breaking $a[URI]s into their components so that they could be merged.
	However, all accounting for <code>.</code> and <code>..</code> directories, as well as all processing of the $a[URI] components to form the new absolute $a[URI], had to be coded by hand.
	It took several hours to get it right, but I think that my new function does what I need it to now.
</p>
<p>
	My next task was to find a way to find hyperlinks in a downloaded page so that the $a[URI]s can be collected, normalized, and added to the database.
	I thought that this task would be more difficult than the task of normalizing relative $a[URI]s, but I was pleasantly surprised.
	$a[PHP]&apos;s <a href="https://php.net/manual/en/function.xml-parse-into-struct.php"><code>xml_parse_into_struct()</code> function</a> performs most of the leg work.
	This function&apos;s output will also make it easy to take into account the page&apos;s preferred base base $a[URI], a feature that I wanted to add way later, but will now be able to build very early on.
	I was worried that this would only work on $a[XHTML] pages, as $a[HTML] is not $a[XML]-compliant, but it also seems to work on the few $a[HTML] pages that I tested it on.
	Like the curl_*() functions, the xml_*() functions require passing around a resource handle, so I wrapped them up in a class as well.
</p>
<p>
	I quickly found a strange issue with the new class though.
	Each object instantiated from it can only be used to parse a single $a[XML] document.
	I do not think that this is an error in the class, but rather, an issue with the underlying $a[PHP] functions, and in fact, I witness the same behavior when I remove my wrapper class.
</p>
<p>
	I performed my first trial run of the search engine&apos;s spider today, and at first, it seemed to be doing very well.
	However, it found a large file that someone had linked to and it got stuck there.
	I think that it ate all the memory of my machine, too, as my machine ended up locking up entirely after a couple hours of being stuck on this one file.
	I considered screening the Content-Type headers of files before downloading them, but the particular file that clogged up the spider claims to be of type <code>text/plain; charset=UTF-8</code>.
	It was not a text file though, it was an XZ compressed file.
	Headers cannot always be trusted, as servers can be misconfigured.
	However, I think that the real threat is not misconfigured servers, but maliciously-configured servers.
	I should not rely on headers for anything as important as keeping the spider unclogged.
	There does not seem to be a direct way to limit file download size, but someone on $a[IRC] gave me <a href="https://stackoverflow.com/questions/17641073/how-to-set-a-maximum-size-limit-to-php-curl-downloads">a hint</a> as to how to set a download limit in a less direct way.
	It seems a little confusing to me, potentially because it has been a long day, so I will try working with this again tomorrow.
</p>
<p>
	My <a href="/a/canary.txt">canary</a> still sings the tune of freedom and transparency.
</p>
END
);
